Skip to content

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Aug 2, 2020

What changes were proposed in this pull request?

This PR proposes to make the behavior consistent for the path option when loading dataframes with a single path (e.g, option("path", path).format("parquet").load(path) vs. option("path", path).parquet(path)) by disallowing path option to coexist with load's path parameters.

Why are the changes needed?

The current behavior is inconsistent:

scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
                                                                                
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+


scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+

Does this PR introduce any user-facing change?

Yes, now if the path option is specified along with load's path parameters, it would fail:

scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
                                                                                
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
  ... 47 elided

The user can restore the previous behavior by setting spark.sql.legacy.pathOptionBehavior.enabled to true.

How was this patch tested?

Added a test

@imback82
Copy link
Contributor Author

imback82 commented Aug 2, 2020

I also see that path is treated different depending on the number of paths. For example,

// For single path, "path" option is overridden with path given in "load"
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+                                                                         
|value|
+-----+
|    1|
+-----+

// For multiple paths, "path" option is "added" to the paths given in "load"
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test", "/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
|    1|
+-----+

Should I just fix this inconsistency instead of this PR? If so, what should be the expected behavior (just ignore path option if paths are passed to load or always add path to the given paths)?

cc: @cloud-fan @dongjoon-hyun @viirya (Thanks in advance!)

@SparkQA
Copy link

SparkQA commented Aug 2, 2020

Test build #126945 has finished for PR 29328 at commit 63f3d60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82 imback82 changed the title [SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs Aug 2, 2020
@imback82
Copy link
Contributor Author

imback82 commented Aug 3, 2020

FYI, the existing test "SPARK-11544 test pathfilter" fails if I change

assert(spark.read.options(extraOptions).json(path).count() === 2)

to

assert(spark.read.options(extraOptions).format("json").load(path).count() === 2)

We expect the same behavior b/w these two APIs right?

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126949 has finished for PR 29328 at commit 296d4bb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft August 3, 2020 21:45
@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127120 has finished for PR 29328 at commit 17c94a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127118 has finished for PR 29328 at commit 9b6be0c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127121 has finished for PR 29328 at commit 4c6ada7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imback82 imback82 marked this pull request as ready for review August 6, 2020 19:30
@imback82 imback82 changed the title [WIP][SPARK-32516][SQL] 'path' option should be treated consistently when loading dataframes for different APIs [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters Aug 6, 2020
@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127153 has finished for PR 29328 at commit cd15fe2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2020

Test build #127346 has finished for PR 29328 at commit 808b7c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2020

Test build #127350 has finished for PR 29328 at commit 63c9383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM except one question: do we still need this fix? https://github.com/apache/spark/pull/29328/files#r468330253

@imback82
Copy link
Contributor Author

imback82 commented Aug 13, 2020

LGTM except one question: do we still need this fix? https://github.com/apache/spark/pull/29328/files#r468330253

Yes, without it, the following tests fail:

 org.apache.spark.sql.execution.datasources.csv.CSVLegacyTimeParserSuite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.csv.CSVv1Suite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.csv.CSVv2Suite.SPARK-23846: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonLegacyTimeParserSuite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonLegacyTimeParserSuite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonV1Suite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonV1Suite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0
 org.apache.spark.sql.execution.datasources.json.JsonV2Suite.SPARK-11544 test pathfilter
 org.apache.spark.sql.execution.datasources.json.JsonV2Suite.SPARK-23849: schema inferring touches less data if samplingRatio < 1.0

Btw, this is an existing bug. For example, for "SPARK-11544 test pathfilter", if I change spark.read.options(extraOptions).json(path) to spark.read.options(extraOptions).option("path", path). format("json").load(), it would fail. The changes in this PR are surfacing the existing issue.

@imback82
Copy link
Contributor Author

@cloud-fan, should we also make the behavior the same on the save path? Currently, the "path" option will be overwritten if a path is passed to save. Let me know if we need to enforce the same check (no coexistence), and I can do it either in this PR or as a follow-up PR. Thanks!

@cloud-fan
Copy link
Contributor

Btw, this is an existing bug.

Can we have a separate PR for this bug? Then we can backport the bug fix.

should we also make the behavior the same on the save path?

This is a good point. We never document the overwrite behavior anywhere and people may not realize the path option is overwritten by the save path parameter. Let's do it in a new PR.


- In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

- In Spark 3.1, when loading a dataframe, `path` option cannot coexist with `load()`'s path parameters. For example, `spark.read.format("csv").option("path", "/tmp").load("/tmp2")` or `spark.read.option("path", "/tmp").csv("/tmp2")` will throw `org.apache.spark.sql.AnalysisException`. In Spark version 3.0 and below, `path` option is overwritten if one path parameter is passed to `load()`, or `path` option is added to the overall paths if multiple path parameters are passed to `load()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should paths be mentioned too?

Copy link
Contributor Author

@imback82 imback82 Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned paths as well.


if ((extraOptions.contains("path") || extraOptions.contains("paths")) && paths.nonEmpty) {
throw new AnalysisException("There is a 'path' or 'paths' option set and load() is called " +
"with path parameters. Either remove the option or put it into the load() parameters.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either remove the option or not put it into the load() parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the suggestion from #29328 (comment), but let me know if the current message is still vague. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it clearer:

Either remove the path option if it's the same as the path parameter, or add it
to the load() parameter if you do want to read multiple paths.

paths = paths,
className = classOf[TextFileFormat].getName,
options = options.parameters
options = options.parameters - "path"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments here at least? I feel it will confuse code readers later on why removing it...

@imback82 imback82 changed the title [SPARK-32516][SQL] 'path' option cannot co-exist with load()'s path parameters [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters Aug 20, 2020
@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127671 has finished for PR 29328 at commit ac007c0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127674 has finished for PR 29328 at commit 62794e5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127698 has finished for PR 29328 at commit d5c4203.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Sorry, I forgot about one thing: we need a legacy config to restore to the previous behavior, in case users don't want to change their queries when upgrading Spark.

@imback82 can you add it?

// force invocation of `load(...varargs...)`
option("path", path).load(Seq.empty: _*)
if (sparkSession.sessionState.conf.legacyPathOptionBehavior) {
option("path", path).load(Seq.empty: _*)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to restore the legacy behavior. Note that the legacy behavior is different for the following:

  • spark.read.option("path", path).parquet(path) => path is read twice
  • spark.read.format("parquet").option("path", path).load(path) => path is read once (option is overwritten)

@SparkQA
Copy link

SparkQA commented Aug 23, 2020

Test build #127789 has finished for PR 29328 at commit 92bf5ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2020

Test build #127790 has finished for PR 29328 at commit e1decd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e3a88a9 Aug 24, 2020
@cloud-fan
Copy link
Contributor

@imback82 please open a PR to do the same check for DataFrameWriter, DataStreamReader/Writer, thanks!

@imback82
Copy link
Contributor Author

Yes, will do. Thanks for the review!

cloud-fan pushed a commit that referenced this pull request Aug 27, 2020
…arameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()

### What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

### Why are the changes needed?

The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.

### Does this PR introduce _any_ user-facing change?

Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added new tests.

Closes #29543 from imback82/path_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
dongjoon-hyun pushed a commit that referenced this pull request Sep 11, 2020
…der/Writer should not fail

### What changes were proposed in this pull request?

This is a followup of #29328

In #29328 , we forbid the use case that path option and path parameter are both specified.  However,  it breaks some use cases:
```
val dfr =  spark.read.format(...).option(...)
dfr.load(path1).xxx
dfr.load(path2).xxx
```

The reason is that: `load` has side effects. It will set path option to the `DataFrameReader` instance. The next time you call `load`, Spark will fail because both path option and path parameter are specified.

This PR removes the side effect of `save`/`load`/`start`  to not set the path option.

### Why are the changes needed?

recover some use cases

### Does this PR introduce _any_ user-facing change?

Yes, some use cases fail before this PR, and can run successfully after this PR.

### How was this patch tested?

new tests

Closes #29723 from cloud-fan/df.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants